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The widespread use and popularity of collaborative content sites (e.g., IMDB, Amazon, Yelp, etc.) has created 
rich resources for users to consult in order to make purchasing decisions on various products such as movies, 
e-commerce products, restaurants, etc. Products with desirable tags (e.g., modern, reliable, etc.) have higher 
chances of being selected by prospective customers. This creates an opportunity for product designers to 

ff) ■ design better products that are likely to attract desirable tags when published. In this paper, we investigate 

how to mine collaborative tagging data to decide the attribute values of new products and to return the top-fe 
products that are likely to attract the maximum number of desirable tags when published. Of course, real- 

£Nj i world product design is a complex task, and tag desirability is only one - albeit novel - aspect of the design 

considerations. The motivation is that the returned set of k products can assist product designers who can 
then select from among them using additional constraints such as price, profitability, etc. Given a training 

^A i set of existing products with their features and user-submitted tags, we first build a Naive Bayes Classifier 

for each tag. We show that the problem of is NP-complete even if simple Naive Bayes Classifiers are used 
for tag prediction. We present a suite of algorithms for solving this problem: (a) an exact two-tier algorithm 
(based on top-fc querying techniques), which performs much better than the naive brute-force algorithm and 
works well for moderate problem instances, and (b) a set of approximation algorithms for larger problem 
instances: a novel polynomial-time approximation algorithm with provable error bound and a practical hill- 
climbing heuristic. We conduct detailed experiments on synthetic and real data crawled from the web to 
evaluate the efficiency and quality of our proposed algorithms, as well as show how product designers can 
O ' benefit by leveraging collaborative tagging information. 

Categories and Subject Descriptors: H.4 [Information Systems Applications]: Miscellaneous 

General Terms: Algorithms, Performance 

£^v , Additional Key Words and Phrases: collaborative tagging, product design, naive bayes, optimization 

1. INTRODUCTION 

^-". \ Motivation: The widespread use and popularity of online collaborative content sites 

has created rich resources for users to consult in order to make purchasing decisions 
on various products such as movies, e-commerce products, restaurants, etc. Various 
websites today (e.g., Amazon for e-commerce products, Flickr for photos, YouTube for 
videos) encourage users to actively participate by assigning labels or tag to online re- 
sources with a purpose to promote their contents and allow users to share, discover 
and organize them. An increasing number of people are turning to online ratings, re- 
views and user-specified tags to choose from among competing products. Products with 
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desirable tags (e.g., modern, reliable, etc.) have a higher chance of being selected by 
prospective customers. This creates an opportunity for product designers to design 
better products that are likely to attract desirable tags when published. In addition 
to traditional marketplaces like electronics, autos or apparel, tag desirability also ex- 
tends to other diverse domains. For example, music websites such as Last . f m use social 
tags to guide their listeners in browsing through artists and music. An artist creating 
a new musical piece can leverage the tags that users have selected, in order to select 
the piece's attributes (e.g. acoustic and audio features) that will increase its chances of 
becoming popular. Similarly, a blogger can select a topic based on the tags that other 
popular topics have received. 

Our paper investigates this novel tag maximization problem, i.e., how to leverage 
collaborative tagging information to decide the attribute values of new products and to 
return the top-fc products that are likely to attract the maximum number of desirable 
tags when published. We provide more details as follows. 

Tag Maximization Problem: Assume we are given a training data of objects (i.e., 
products), each having a set of well-defined features (i.e., attributes) and a set of 
user-submitted tags (e.g., cell phones on Amazon's website, each described by a set 
of attributes such as display size, Operating System and associated user tags such as 
lightweight, easy to use). From this training data, for each distinct tag, we assume 
a classifier has been constructed for predicting the tag given the attributes. Tag pre- 
diction is a recent area of research (see Section [7] for discussion of related work), and 
the existence of such classifiers is a key assumption in our work. In addition to the 
product's explicitly specified attributes, other implicit factors also influence tagging 
behavior, such as the perceived utility and product quality to the user, the tagging be- 
havior of the user's friends, etc. However, pure content-based tag prediction approaches 
are often quite effective - e.g., in the context of laptops, attributes such as smaller di- 
mensions and the absence of a built-in DVD drive may attract tags such as portable. 
Given a query consisting of a subset of tags that are considered desirable, our task 
is to suggest a new product (i.e., a combination of attribute values) such that the ex- 
pected number of desirable tags for this potential product is maximized. This can be 
extended to the top-fc version, where the task is to suggest the fc potential products 
with the highest expected number of desirable tags. In addition to the set of desirable 
tags, our problem can also consider a set of undesirable tags, e.g. unreliable. The op- 
timization goal in this case is to maximize the number of desirable tags and minimize 
the undesirable ones - a simple combination function is to optimize the expected num- 
ber of desirable tags minus the expected number of undesirable tags. In our discussion 
so far, we have not explained how the set of desirable and undesirable tags are cre- 
ated. Although this is not the focus of this paper, we mention several ways in which 
this can be done. For example, domain experts could study the set of tags and mark 
them accordingly. Automated methods may involve leveraging the user rating or the 
sentiment of the user review to classify tags as desirable, undesirable or unimportant. 

Novelty, Technical Challenges and Approaches: The dynamics of social tagging 
has been an active research area in recent years. However related literature primar- 
ily focuses on the problems of tag prediction, including cold-start recommendation to 
facilitate web-based activities. To our best knowledge, tags have not been studied in 
the context of product design before. Of course, real-world product design is a complex 
task, and is an area that has been heavily studied in economics, marketing, industrial 
engineering and more recently in computer science. Many factors like the cost and re- 
turn on investment are currently considered. We argue that the user feedback (in the 
form of tags of existing competing products) should be taken into consideration in the 
design process, especially since online user tagging is extremely widespread and offers 

ACM Transactions on Knowledge Discovery from Data, Vol. , No. , Article , Publication date: . 



Top-K Product Design Based on Collaborative Tagging Data 3 

unprecedented opportunities for understanding the collective opinion and preferences 
of a huge consumer base. We envision user tags to be one of the several factors in 
product design that can be used in conjunction with more traditional factors - e.g., our 
algorithms return fc potential new products that maximize the number of desirable 
tags; and this information can assist content producers, who can then further post- 
process the returned results using additional constraints such as profitability, price, 
resource constraints, product diversity, etc. Moreover, product designers can explore 
the data in an interactive manner by picking and choosing different sets of desirable 
tags to get insight on how to build new products that target different user populations 
— e.g., in the context of cell phones, tags such as lightweight and powerful target 
professionals, whereas tags such as cheap, cool target younger users. 

Solving the tag maximization problem is technically challenging. In most product 
bases, complex dependencies exist among the tags and products, and it is difficult to 
determine a combination of attribute values that maximizes the expected number of 
desirable tags. In this paper we consider the very popular Naive Bayes Classifier for 
tag prediction^. Extending our work for other popular classifiers is one of our future re- 
search directions. As one of our first results, we show that even for this classifier (with 
its simplistic assumption of conditional independence), the tag maximization problem 
is NP-Complete. Given this intractability result, it is important to develop algorithms 
that work well in practice. A highlight of our paper is that we have avoided resort- 
ing to heuristics, and instead have developed principled algorithms that are practical 
and at the same time possess compelling theoretical characteristics. We also mention 
a practical heuristic that works very well for real-world instances. 

Our first algorithm is a novel exact top-fc algorithm ETT (Exact Two-Tier Top-fc algo- 
rithm) that performs significantly better than the naive brute-force algorithm (which 
simply builds all possible products and determines the best ones), for moderate prob- 
lem instances. Our algori thm is based on n ontrivial adaptations of top-fc query pro- 
cessing techniques (e.g., [ Fagin et al. 2001) ), but is not merely a simple extension of 



TA. The complexity arises because the problem involves maximizing a sum of terms, 
where within each term there is a product of quantities which are interdependent with 
the quantities from the other terms. Our top-fc algorithm and has an interesting two- 
tier architecture. At the bottom tier, we develop a sub-system for each distinct tag, 
such that each sub-system has the ability to compute on demand a stream of products 
in order of decreasing probability of attracting the corresponding tag, without hav- 
ing to pre-compute all possible products in advance. In effect, each sub-system simu- 
lates sorted access efficiently. This is achieved by partitioning the set of attributes into 
smaller groups (thus, each group represents a partial product), and running a separate 
merge algorithm over all the groups. The top tier considers the products retrieved from 
each sub-system in a round-robin manner, computes the expected number of desirable 
tags for each retrieved product, and stops when a threshold condition is reached. Al- 
though in the worst case this algorithm can take exponential time, for many datasets 
with strong correlations between attributes and tags, the stopping condition is reached 
much earlier. 

However, although the exact algorithm performs well for moderate problem sizes, 
it did not easily scale to larger real-world sized datasets, and thus we also develop 
several approximation algorithms for solving the problem. Designing approximation 
algorithms with guaranteed behavior is challenging, since no known approximation 



1 Naive Bayes Classifiers are often effective, rival the performance of more sophisticated clas- 
sifiers, a nd are known to perfo rm well in social network applications. For instance, Pak and 
Paroubek |Pak and Paroubek 2010| show that Naive Bayes performs better than SVM and CRF in clas- 
sifying the sentiment of blogs. 
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algorithm for other NP-Complete problems can be easily modified for our case. Our 
exact algorithm ETT can be modified to serve as an approximation algorithm - we 
can change the threshold condition such that the algorithm stops when the threshold 
is within a small user-provided approximation factor of the top-fc product scores pro- 
duced thus far. This algorithm can guarantee an approximation factor in the quality 
of products returned, but would run in exponential time in the worst case. Our first 
approximation algorithm PA (Poly-Time Approximation algorithm) runs in worst case 
polynomial time, and also guarantees a provable bound on the approximation factor 
in product quality. The principal idea is to group the desirable tags into constant- sized 
groups, find the top-fc products for each sub-group, and output the overall top-fc prod- 
ucts from among these computed products. Interestingly, we note that in this algorithm 
we create sub-problems by grouping tags; in contrast in our exact algorithm we create 
sub-problems (i.e., subsystems) by grouping attributes. For each sub-problem thus cre- 
ated, we show that it can be solved by a polynomial time approximation scheme (PTAS) 
given any user-defined approximation factor. The algorithm's overall running time is 
exponential only in the (constant) size of the groups, thus giving overall a polynomial 
time complexity. 

Our second approximation algorithm is a more practical hill climbing heuristic HC. 
It starts with a randomly generated product (starts at the base of the hill), and then 
repeatedly improves the solution by changing some attribute values (walks up the hill) 
until some no further small changes improves the product (reaches a local maximum). 
This algorithm can be improved by repeating with random restarts. Multiple (fc or 
more) random restarts may also lead us to the multiple locally optimal products, of 
which the top-fc may be returned. This algorithm works well in practice, as shown em- 
pirically in Section [6] But we propose this as a viable efficient solution to the problem 
for handling large real datasets; it does not guarantee any sort of worst case behavior, 
either in running time or in product quality. In fact, we prove that there exist datasets 
for which the expected number of tags of a globally optimum product can be exponen- 
tially larger than that of a locally optimum product. 

We experiment with synthetic as well as real datasets crawled from the web to com- 
pare our algorithms. User and case study on the real dataset demonstrates that prod- 
ucts suggested by our algorithms appear to be meaningful. With regard to efficiency, 
the exact algorithm performs well on moderate problem instances, whereas the ap- 
proximation algorithms scaled very well for larger datasets. 

Summary of Contributions: We make the following main contributions. 

— We introduce the novel problem of top-fc product design based on user-submitted tags 
and show that this problem is NP-complete, even if tag prediction is modeled using 
simple Naive Bayes Classifiers. 

— We develop an exact algorithm ETT to compute the top-fc best products that works 
well for moderate problem instances. 

— We also present a set of approximation algorithms for larger problem instances: HC, 
empirically shown to work extremely well for real datasets; and PA based on a poly- 
nomial time approximation scheme (PTAS), with provable error bounds. 

— We perform detailed experiments on synthetic and real datasets crawled from the 
web to demonstrate the effectiveness of our developed algorithms. 

2. PROBLEM FRAMEWORK 

Let D = {o X) o 2 , ..., o„} be a collection of n products, where each product entry is defined 
over the attribute set A = {A x , A 2 , ..., A m } and the tag dictionary space T = {T 1; T 2 , ..., 
T r }. Each attribute A t can take one of several values a t from a multi-valued categorical 
domain D-,, or one of two values 0, 1 if a boolean dataset is considered. The attribute 
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set A can be a mix of categorical and boolean attributes too. A tag Tj is a bit where 
a implies the absence of a tag and a 1 implies the presence of a tag for product o. 
Each product is thus a vector of size (m + r), where the first m positions correspond to 
a vector of attribute values, and the next r positions correspond to a boolean vectorQ 

Example 2.1. Consider a camera dataset with n = 3 rows, m = 4 attributes and 
r = 3 tags, where each tuple represents a camera. The categorical attributes are Brand, 
Type, etc., and the boolean attributes are Auto Focus, Image Stabilizer, etc. Suppose there 
are three user-submitted tags namely , lightweight, user-friendly, and excellent 
quality. The value of a tag column is 1 if the camera entry tuple has been annotated 
by this tag, and otherwise. An example of such a camera dataset, having a mix of 
categorical and boolean attributes is shown in Table HI A camera manufacturing com- 
pany may investigate such a training set to learn the correlation between attributes 
and tags and design new camera(s) with the best attribute values so that it generates 
maximum positive response from the customers. □ 

Table I. Sample camera training set of boolean and categorical attributes, as well as user-submitted tags 





Attribute 


Tag 


ID 


Brand 


Type 


Auto Focus 


Image Stabilizer 


lightweight 


user-friendly 


excellent quality 


1 


Nikon 


SLR 


1 


1 








1 


2 


Canon 


Compact 


1 


1 


1 


1 





3 


Sony 


Compact 


1 





1 









We assume such a dataset has been used as a training set to build Naive Bayes 
Classifiers (NBC), that classify tags given attribute values (one classifier per tag). The 
classifier for tag Tj defines the probability that a new product o is annotated by tag Tj : 



Pr(T. } | o) = Pr(Tj | a 1 ,a 2 , ...,a m ) 

= PrjT^.UZ 1 Pr(a a | Tj) 
Pr(a 1 ,a 2 ,...,a m ) 



(1) 



where a % is the value of o for attribute A % , a % e D % . The probabilities Pr(ai \ Tj) are 
computed using the dataset. In particular, Pr(cii | Tj) is the proportion^ of products 
tagged by Tj that have A\ = a % . Pr{Tj) is the proportion of products in the dataset that 
has I}. 

Similarly, we compute the probability Pr(T/ \ o) of a product o not having tag Tf 

Pr(Tj> | o) = p r^')^Pr{ ai | ^ 



Pr(a ± ,a 2 ,...,a m ) 

We know that Pr(Tj \ o) + Pr(T/ | o) = 1; hence from Equations [H|2] we get 
Pr( ai ,a 2 ,...,a m ) = Pr(Tj).ILZ tPrfa | Tj) + 
Pr(T/).n? =1 Pr(a t \ Tj') 
From Equations [T]and [3} 

Pr(Tj | o) = Pr(Tj | a x ,a 2 ,...,a m ) 

Pr(r j ).n^ 1 Pr(a i | T 3 ) 



(2) 



(3) 



PriT^.U^Prim | T i ) + Pr(T j ').n.VL 1 Pr{a i | T/) 



2 Our framework allows numeric attributes, but as is common with Naive Bayes Classifiers, we assume that 
they have been appropriately binned into discrete ranges. 

3 The observed probabilities are smoothened using the Bayesian m-estimate method | Cestnik 1990 1. We note 
that more sophisticated Bayesian methods that use an informative prior may be employed instead. 
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1 , Pr(T j ') - nm Pr( ai \Tj') 
X T Pr(Tj) - L1 !=l Pr(a,i\Tj) 



For convenience we use the notation 

Pr(T 



Pr(T,-) ^PrKlT,) 



Consider a query which picks a set of desirable tags T d = {Ti, . . . ,T Z } C T. 
The expected number of desirable tags Tj e T d that a new product o, characterized 
by (01, a 2 , ..., Om) £ A is annotated with, is given by: 

E( ,T d ) = £* =1 — 1— (5) 

1 + P j 

We are now ready to formally define the main problem. 

TAG MAXIMIZATION PROBLEM: Given a dataset of tagged products ED = {o 1} o 2 , 
..., o„}, and a query T d , design k new products that have the highest expected number 
of desirable tags they are likely to receive, given by Equation® 

For the rest of the paper, we consider boolean attributes so that for attribute A t , its 
value a, is either or 1. We explain our algorithms in a boolean framework, which 
can be readily generalized to handle categorical data. We also assume that all tags are 
of equal weight- if tags are of varying importance, Equation [5] can be re-written as a 
weighted sum, and all our proposed algorithms can be modified accordingly. 

We now analyze the computational complexity of the main problem and then propose 
our algorithmic solutions. 

3. COMPLEXITY ANALYSIS 

In this section, we analyze the computational complexity of the main problem. Clearly, 
the brute-force exhaustive search will require us to design all possible 2 m number 
of products and compute E(o, T d ) for each of them. This naive approach will run in 
exponential time. However, we next give a proof sketch that the Tag Maximization 
problem is NP-Complete, which leads us to believe that in the worst case we may not 
be able to do much better than the naive approach. 

THEOREM 3.1. The Tag Maximization problem is NP-Complete even for boolean 
datasets and for k = 1. 

Proof: The membership of the decision version of the problem in NP is obvious. To 
verify NP-hardness, we reduce the 3SAT problem to the decision version of our prob- 
lem. We first reduce the 3SAT problem to the minimization version of the optimization 
problem, represented as E mm (o, T d ) and then reduce E"""(o, T d ) to E(A, T d ). 

Reduction of 3SAT to decision version of E Tm "(o, T d ): 

3SAT is the popular NP-complete boolean satisfiability problem in computational 
complexity theory, an instance of which concerns a boolean expression in conjunctive 
normal form, where each clause contains exactly 3 literals. Each clause Cj is mapped 
to a tag Tj in the instance of E"""(o, T d ) and each variable x t is mapped to attribute 
value a;. We make the following assignments so that if there is a boolean assignment 
vector a = [oi, ..., a m ] that satisfies 3SAT, then E"""(o, T d ) equals zero (and if a does 
not satisfy 3SAT, then E mm (o, T d ) has a non-zero sum). 

— For a variable Xi specified as positive literal in 3SAT, set Pr(a 4 = | Tj) = 1 

— For a variable Xi specified as negative literal in 3SAT, set Pr(a l = 1 | Tj) = 1 
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— For a particular clause and for the unspecified attributes (variables), set Pr(aj = | 
T 3 ) = Pr(a* = 1 | Tj) = 1 

For example, consider 3SAT instance (-«i Vi 2 V ~^x 3 ) A {x\ V ~^x 2 V -10:4). For each 
tag, we create two products. For the first clause (that corresponds to the first tag), x\ 
(that corresponds to A\) is negative and hence for both the first and second product it 
is Ai = 1. X4, is missing from the first clause; hence for the first product it is A A = 
and for the second it is A4 = 1. Similarly, the assignments of the second clause (that is 
the second tag) can be explained. Again, an assignment : A\ — 1, A 2 = 1, A3 = 0, A 4 = 
satisfying the 3SAT instance has E mm (o, T d ) = 0. 

Table II. Table of attributes and tags 



Attributes 


Tags 


A 1 


A 2 


A3 


A 4 


Ti 


T 2 


1 





1 





1 





1 





1 


1 


1 








1 





1 





1 





1 


1 


1 





1 



Reduction of E min (A, T d ) to E(A, T d ) : 

If we have a boolean assignment vector a = [ai, ..., a m ] that minimizes the expected 
number of tags being present, we have the corresponding Pr(T/ | a x , a 2 , ..., a m ). Hence, 
we get Pr(Tj \ a lt a 2 , ..., a m ) = 1 - Pr(T/ | a ± , a 2 , ..., a m ) that maximizes the expected 
number of tags being present. □ 

Section[4] and Section [5] next describe our algorithmic solutions to this NP-Complete 
problem in a boolean framework. 

4. EXACT ALGORITHMS 

A brute-force exhaustive approach (henceforth, referred to as Naive) to solve the Tag 
Maximization problem requires us to design all possible 2 m number of products and 
compute E(o, T d ) for each possible product. Note that the number of products in the 
dataset is not important for the execution cost, since an initialization step can calcu- 
late all the conditional tag-attribute probabilities by a single scan of the dataset. The 
Naive approach will clearly run in exponential time, and the NP-completeness proof 
leads us to believe that in the worst case we may not be able to do much better. Al- 
though general purpose pruning-based optimization techniques (such as branch-and- 
bound algorithms) can be used to solve the problem more efficiently than Naive, such 
approaches are only limited to constructing the top-1 product, and it is not clear how 
they can be easily extended for k > 1. 

In the following subsection, we propose a novel exact algorithm for any k based 
on interesting and nontrivial adaptations of top-k query processing techniques. This 
algorithm is shown in practice to explore far fewer product candidates than Naive, 
and works well for moderate problem instances. 

4.1. Exact Two-Tier Top-k Algorithm 

We develop an exact two tier top-fc algorithm (ETT) for the Tag Maximization problem. 
For simplicity, henceforth we refer to desirable tags as just tags. The main idea of ETT 
is to determine the best products for each individual tag in tier-1 and then match these 
products in tier-2 to compute the globally best products (across all tags). Both tiers 
use pipelined techniques to minimize the amount of accesses, as shown in Figure [TJ 
The output of tier-1 is z unbounded buffers (one for each tag) of complete products, 
ordered by decreasing probability for the corresponding tag. These buffers are not fully 
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materialized, but may be considered as sub-systems that can be accessed on demand 
in a pipelined manner. 

In tier-2, the top products from the z buffers are combined in a pipelined 
man ner to produce t he global top-/c products, akin the Threshold Algorithm 
(TA) [Fa gin et al. 2001 1. In turn, tier-2 makes GetNextO requests (see Figured]) to var- 
ious buffers in tier-1 in round-robin manner. In tier-1, for each specific tag, we partition 
the set of attributes into subsets, and for each subset of attributes we precompute a 
list of all possible partial attribute value assignments, ordered by their score for the 
specific tag (the score will be denned later). The partial products are then scanned and 
joined, leveraging results from Rank- Join algorithms piyas et al. 20091 that support 
top-fc ranked join queries in relational databases, in order to feed information to tier- 
2. A single GetNextO for a specific tag may translate to multiple retrievals from the 
lists of partial products in tier-1, which are then joined into complete products and 
returned. 



Final Result 



To|>k o 



GetNextO 




GetNextO | 














Ln 


L, 2 




Ln 


L„ 


L 2 ! 




L :l 















t 

GetNextO I 



L„ L, L 



T, 



Fig. 1 . Two-Tier Top-K Algorithm Framework 



Tier-1. Suppose we partition the m attributes 
ml = y attributes as follows: {o ls 

, On 



'}, {« 



rn'-\-i i • ■ 



4.1.1 
subset has 

\®m— m' + i 5 • 

list Lji has 2 m entries (partial products). Consider the first list Lji. 
partial product o p e Lji with attribute values ai, . . . , aw for Tj is 

Pr{ ai | T/) 



into I subsets, where each 



'}. 



,}. We create partial product lists Lji, . . . ,Lji for each tag Tj. Each 

The score of a 



E 



partial 



(<?,&}) = yPj-^ 



:1 Pr(ck | Tj 



(6) 



where Pj= yS.j , Note that the l-th root of Pj is used in order to distribute the effect 

of Pj from Equation |4] to the I lists, such that when they are combined using multipli- 
cation, we get P 3 . 
Lists Lji are ordered by descending E x . , since Rj appears on the denominator of 

Equation [5l The I lists are accessed in round-robin fashion and for every combination 
of partial products from the lists, we join them to build a complete product and resolve 
its exact score by Equation [H 
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A product is returned as a result of GetNextO to tier-2 if its score is higher than the 
MPFS (Maximum Possible Future Score), which is the upper bound on the score of an 
unseen product. To compute MPFS, we assume that the current entry from a list is 
joined with the top entries from all other lists: 

MPFS = - ; , ; ~ l — ; (7) 



1 +max((sj X .hj 2 .. ■ hji), (hj ± .Sj a .. ■ hji),.., (h jl .h 



,/2- 



Sjl)) 



where Sji and hji are the last seen and top entries from list L jt respectively. 

4. 1.2. Tier-2. In this tier, the z unbounded buffers, one for each tag, are combined us- 
ing the summation function, as shown in Equation (5) Each product from one buffer 
matches exactly one entry (the identical product) from each of the other buffers. Prod- 
ucts are retrieved from each buffer using GetNextO operations, and once retrieved we 
directly compute its score for all other tags by running each Naive Bayes, without us- 
ing the process of tier- 1. A product is output if its score is higher than the threshold, 
which is the sum of the last seen scores from all z buffers. A bounded buffer with k 
best results so far is maintained. On termination, this buffer is returned as the top-fc 
products. 
The pseudocode of ETT is shown in Algorithm!]] 

Table III. Example products data set 



Attribute 


Tag 


ID 


A, 


A 2 


As 


A 4 


T x 


'J'x 


1 











1 








2 





1 











1 


3 
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Fig. 2. Iteration 1: Exact Two-Tier Top-K Algorithm for Example in Table ITTT1 
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Algorithm 1 ETT (Naive Bayes probabilities, attributes per group m', k): top-fc 
exact products 

/ /Main Algorithm 
1: Top-fc-Buffer <- {} 
2: for j = 1 to z do 

3: Bj i— {} II unbounded buffer of candidate results-products per tag 
4: for i = 1 to I do 
5: Sji, hji <— top entry from list Lji 

6: end for 
7: end for 
8: Call ThresholdO 

/ 1 Method ThresholdQ - Tier-2 

l: while true do 

2: for j = 1 to z do 

3: (oj, score j(oj) «- GetNext(j) 

4: ExactScore(oj ) <— Compute for o 3 by Equation [5] 

5: end for 

6: Update Top-fc-Buffer with new products if necessary 

7: MinK 4— lowest score in Top-A: buffer 

8: a <— V ■ score j(°j) /I Threshold 

9: if MinK > a then 
10: return top-fc products 

li: end if 
12: end while 

/ /Method GetNext(j) : (oj, score j(oj)) - Tier-1 
l: while true do 

2: Compute MPFS by Equation[7] 

3: // score j(o) for product o is defined as 1/(1 + Rj) (Rj defined by Equation 4) 
4: if Bj has an product o with score j (o) > MPFS then 
5: return (o, scorej (o)) AND remove it from Bj 

6: end if 

7: Retrieve next entry o p from a list Lji in round robin and advance Sji 
8: Join cP with all combinations of partial products from other lists and create all 

products NewProducts 
9: Add NewProducts to buffer Bj of candidate results-products 
10: end while 

Example 4.1. Consider the boo lean dataset of 10 objects, each entry having 4 
attributes and 2 tags in Table IIII1 We partition the 4 attributes into groups of 2 
attributes: (A x , A 2 ) form list Lj X and (A 3 , A A ) form list L J2 . We run NBC and calculate 
all conditional tag-attribute probabilities. The algorithm framework for the running 
example is presented in Figure[2j List L„ and L 12 under tag T x is sorted in decreasing 
order of -=-^ — , given by Equation[6](similarly for L 21 and L 22 under tag T 2 ). The step- 

^partial 

by-step operations of ETT for retrieving the top-1 product for this example is shown 
below : 

(1) [ITERATION 1] Call to ThresholdO in tier-2 calls GetNextO for 7\ and T 2 respec- 
tively in tier-1. 
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— GetNextCTJ returns (1010,0.95) to tier-2: Join-1 builds product 1010, whose 
scorei(1010)=0.95 and MPFS(1010)=0.95. Since score 1 > MPFS, 1010 is re- 
turned. 

— GetNext(T 2 ) returns (1111,0.93) to tier-2. 

— Threshold*): ExactScore(1010)=1.70, ExactScore(llll)=1.75. 

— Bounded Buffer: 1111; MinK=1.75, a=1.88 

— MinK < a, continue. 

(2) [ITERATION 2] Threshold*) in tier-2 calls GetNextO for T ± and T 2 respectively in 
tier-1. 

— GetNextCrj returns (1011,0.92) to tier-2. 

— GetNext(T 2 ) returns (1110,0.88) to tier-2. 

— Threshold*): ExactScore(1011)=1.76, ExactScore(1110)=1.77. 

— Bounded Buffer: 1110; MinK=1.77, a=1.79 

— MinK < a, continue. 

(3) [ITERATION 3] Threshold*) in tier-2 calls GetNextO for T x and T 2 respectively in 
tier-1. 

— GetNext*^) returns (0010,0.89) to tier-2. 

— GetNext(T 2 ) returns (0111,0.84) to tier-2. 

— ThresholdO: ExactScore(0010)=1.76, ExactScore(0111)=1.77. 

— Bounded Buffer: 1110; MinK=1.77, a=1.74 

— MinK > a, return 1110 and terminate. 

Thus, ETT returns the best product by just looking up 6 products, instead of 16 prod- 
ucts (as in Naive algorithm). □ 

Grouping of Attributes: The ETT algorithm partitions the set of attributes into 
smaller groups (each group representing a partial product), which we join to re- 
trieve the best product to feed to tier-2. We can employ state-of-art techniques to 
create a graph, where each node corresponds to an attribute and an edge between 
two attributes is weighed by the absolute value of the correlation between them, and 
then perform graph clustering techniques for partitioning the attributes into as many 
groups as the desired number of lists. If the sets of attributes are highly correlated, 
such grouping of attributes would make our ETT algorithm reach the stopping condi- 
tion earlier than it would if the attributes are grouped arbitrarily. 

5. APPROXIMATION ALGORITHM 

The exact algorithm of Section 14.11 is feasible only for moderate instances of the Tag 
Maximization problem. For larger problem instances, it is necessary to use approxima- 
tion algorithms and/or heuristics to solve the problem. In this section we discuss two 
such algorithms: (a) an approximation algorithm that provides guarantee in the qual- 
ity of the top-k results as well as running time (guaranteed polynomial time); and (b) 
an efficient heuristic that provides no guarantee in either quality of the best products 
returned or in the running time; however, this algorithm is largely of practical interest 
and is empirically shown to perform well in practice. 

5.1. Poly-Time Approximation Algorithm 

Our first algorithm (PA, or polynomial time approximation algorithm) is an approxi- 
mation algorithm with provable error and time bound. The main idea is to group the 
desirable tags into constant-sized groups of z' tags each, find the top-fc products for 
each subgroup, and output the overall top-fc products from among these computed 
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products^ For each sub-problem thus created, we show that it can be solved by a 
polynomial time approximation scheme (PTAS) [ |Garey and Johnson 1990| |, i.e., can be 



solved in polynomial time given any user-defined approximation factor e (function of 
compression factor a and to; details later in Theorem I5.2D . The overall running time 
of the algorithm is exponential only in the (constant) size of the groups, thus giving a 
overall polynomial time complexity. 

Algorithm 2 PA (Naive Bayes probabilities, attributes per group z', compres- 
sion factor a): top-1 approximate product in polynomial time 

/ /Main Algorithm 
1: Partition tags T into z/z' groups T 1; . . . , T x / Z > 
2: for r = 1 to 4 do 
3: O r «- PTAS(T r ) 

4: Compute ExactScore(o r ) by Equation[5] 

5: end for 

6: return o r with max ExactScore 

/ /Method PTAS \T r ) : o 

1: S' <s— {0 m } // boolean vector of size to with all O's 

2: for i = 1 to m do 

3: S t = S'^-y U 5f_! // 5f_! : 5;_j with rth attribute value set to 1 

4: // Compress Si to 5 ? - using compression factor a 

5: S{ <- {} 

6: repeat 

7: o <r- representative product in S^_j 

8: 5,' <- 5,' U {o} 

9: Delete from 5, all products d such that VI} e T r , 

|E(o,{T i })-E(o' > {r i })|<<rE(o,{r i }) 
10: until S % is empty 
11: end for 
12: return product o in S' m with largest |E(o, T r )| 

We now consider a sub-problem consisting of only a constant number of tags, z'. 
We also restrict our discussion to the case k = 1 (more general values of k are dis- 
cussed later). We shall design a polynomial time approximation scheme (PTAS) for 
this sub-problem. A PTAS is defined as follows. Let e > be any user-defined param- 
eter. Given any instance of the sub-problem, let PTAS return the product o a . Let the 
optimal product be o g . The PTAS should run in polynomial time, and Exacts cor e{o a ) > 
(1 — e) ExactS core(og) . 

In describing the PTAS, we first discuss a simple exponential time exact top-1 algo- 
rithm for the subproblem, and then show how it can be modified to the PTAS. Given m 
boolean attributes and z' tags, the exponential time algorithm makes to iterations 
as follows: As an initial step, it produces the set Sq consisting of the single prod- 
uct {0 m } along with its z' scores, one for each tag. In the first iteration, it produces 
the set containing two products S 1 " = {0 m , 10 m_1 } each accompanied by its z' scores, 
one for each tag. More generally, in the zth iteration, it produces the set of products 



4 Interestingly, we note that in this algorithm we create (z/z') sub-problems by grouping tags; in contrast in 
our exact ETT algorithm we create sub-problems (i.e., subsystems) by grouping attributes. 
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Sf = {{0, 1}* x 0" 1 " 1 } along with their z' scores, one for each tag. Each set can be de- 
rived from the set computed in the previous iteration. Once m iterations have been 
completed, the final set ££ contains all 2 m products along with their exact scores, from 
which the top-1 product can be returned, which is that product for which the sum of 
the z' scores is highest. However, this algorithm takes exponential time, as in each 
iteration the sets double in size. 

The main idea of the PTAS is to not allow the sets to become exponential in size. 
This is done by compressing each set Si, having the same form as Sf and Si C S", 
produced in each iteration to another smaller set S' it so that they remain polynomial in 
size. Each product entry in S, can be viewed as points in a z'-dimensional space,whose 
z' co-ordinates correspond to the product scores for z' individual tags respectively, by 
Equation [5j Essentially, we use a clustering algorithm in z'-dimensional space. For 
each cluster, we pick a representative product that stands for all other products in the 
cluster, which are thereby deleted. The clustering has to be done in a careful way so as 
to guarantee that for the products that are deleted, the representative product's exact 
score is be close to the deleted product's exact score. Thus when the top-1 product of 
the final compressed set S' m is returned, its exact score should not be too different from 
exact score of the top-1 product assuming no compression was done. 
The pseudocode of PA is shown in Algorithm^ 

Example 5.1. We execute PA on the example in Table UlTl without any grouping of 
tags (i.e., z = z' = 2, -p = 1), so that execution of PA is equivalent to the execution of 
PTAS. We also execute the exponential time top-1 algorithm (henceforth referred to as 
Exponential) which was adapted to the PTAS. Let the compression factor a be 0.5. 
We start with S' = {0000}. The step-by-step operations of PA (as well as Exponential) 
for retrieving the top-1 product is shown below : 

(1) [ITERATION 1] 

— Si = {0000, 1000}, each product having two-dimensional co-ordinates (0.31, 0.20) 
and (0.51, 0.38) respectively. For Exponential, S? = {0000, 1000} too. 

— Compress Si and get S[ = {1000}. 1000 is the representative product of 0000 

(2) [ITERATION 2] 

— S 2 = {1000, 1100} with two-dimensional co-ordinates (0.51, 0.38) and (0.31, 0.58) 
respectively. For Exponential, S% = {0000, 1000, 0100, 1100} from S?, i.e., 2 2 = 4 
products with two-dimensional co-ordinates (0.31, 0.20), (0.51, 0.38), (0.16, 0.38) 
and (0.31, 0.58) respectively. 

— Compress S 2 and get S' 2 = {1000, 1100}. In other words, no compression is possi- 
ble for the a under consideration. 

(3) [ITERATION 3] 

— S 3 = {1000,1100,1010,1110} with two-dimensional co-ordinates (0.51, 0.38), 
(0.31, 0.58), (0.95, 0.75) and (0.89, 0.88) respectively from S' 2 . For Exponential, 
S% = {0000, 1000, 0100, 1100, 0010, 1010, 0110, 1110} from S%, i.e., 2 3 = 8 products. 

— Compress S3 and get S' 3 = {1000, 1100, 1110}. 1010 is the representative product 
of 1110. 

(4) [ITERATIONS 

— S 4 = {1000, 1100, 1110, 1001, 1101, 1111} with two-dimensional co-ordinates (0.51, 
0.38), (0.31, 0.58), (0.89, 0.88), (0.37, 0.52), (0.20, 0.72) and (0.82, 0.93). For Ex- 
ponential, S4 has all 16 products as we see in Figure [3] 

— Compress S4 and get Si = {1000, 1100, 1111}. 
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Fig. 3. Compression in PA Algorithm (left colu mn) vs. Exponential time Algorithm (right col- 
umn) for Example Dataset of Two Tags in Table HTTI 
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The top-1 approximate product is 1111 with score 1.75 while the optimal product is 1110 
with score 1.77. Figure [3] shows the compression in the four iterations. The boolean 
products in underlined red font are the cluster representatives. □ 

THEOREM 5.2. Given a user defined approximation factor e, a constant sized group 
T r of z' tags, and for k = 1, if we set the compression factor a = e/2m, then: 

(1) For every product o in the uncompressed set Sf n , there is a product o' in the com- 
pressed set S' m for which E(o, T r ) < (1 + cr) m E(o', T r ) 

(2) The output ofPTAS(T r ) has an exact score that is at least jj^t times the exact score 

of the optimal product 

Proof of Part 1: Let of indicate a product belonging to uncompressed set S" = 
{{0, l} 1 x m ~ 1 } in the i th iteration where Sf has all 2* products. Let Oj indicate a 
product belonging to set Si having the same form as Sf, Si C Sf. Let o[ indicate an 
product in compressed set S^, S[ C Si. Note that in the i th iteration, Si is built from 
products in the compressed set in (i - \) th iteration «S,_i, while S[ is built by compress- 
ing Si. Intuitively, the idea is : for a single tag Tj, if scores of two products of and o\ in 
Si are close to each other (so that of is represented by o\ in S[, and of does not exist in 
SI), scores of products of +l and o' i+l are also close to each other, where of +l is of and 
o' i+l is o\ with (i + l) th bit flipped. In Example 15. 1[ product 0000 in uncompressed set 
S\ is represented by product 1000 in S[ since they are close to each other; therefore, a 
product 0100 in uncompressed set S% (but not in S 2 due to its removal from S[) must 
be close to some product belonging to S 2 (which happens to be 1100 in our example). 

More formally, we need to show that for a tag Tj, if Ai = E 5°£' T J I < (1 + ei), then 

A2 = s °'- +1 ''r ( < (1 + £ 2) < P(n)(l + ei), where P(n) is a polynomial in n. 

The score of product of in uncompressed set Sf for tag Tj from Equation[5]is: 

E(of,Tj) 



3> " 1 I Pr(TjQ nm Prja^Tj') 
1 1" Pr(Tj) LL i=l Pr(a,|T 3 ) 

= TTp-Q {say) 

where P = P p$J R\ ^^^j and Q = ^^^Y+ii^j^J are proportional to the prod- 
uct of probabilities for the first i in of and the remaining (m — i) attributes in of 
respectively. Similarly, the scores of products o[, o" +1 and o' i+1 for tag Tj are: 

E{o'i,Tj) = 



E(°5+i,If) = 



l + P'Q 
1 

l + PQ' 

1 

1 + P'Q' 



where P' and Q' are proportional to the product of probabilities for the first i in o\ and 

5 i+l» °i+l 



the remaining (m — i) attributes in of +1 , o' i+1 in which the (i + l) th attribute value is 



flipped from of. 

Assume E(of,Tj) is close to E(o^,Tj), so that o^ represents of and that P' < P so 
that the product of probabilities decrease (i.e., score of product increases) when the 
i th attribute value is flipped. The difference in exact score between of and o- can be 
expressed as: 
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1 ~ no^Tj) 

l + PQ 
l + P'Q 

= ^<(l + ei)(say) (8) 

l + o 

The relationship between E(o" +1 , T 3 ) and E(o'- +1 , Tj) can be similarly expressed as: 

EK +1 ,rj) 

2 E(o?+i.2i) 
l + PQ' 



1 + P'Q' 



i + PQ^ 



l + P'Q^r 



Q 

1 + ac w v , Q' 

= — -r- < 1 + e 2 )(say), where c = ^- (9) 

i + oc <y 

If €2 < P(n)ei where P(n) is a polynomial of n, then the proof is complete. 
From Equation^ we get: x + a < (1 + £l)(1 + ) 

1+ac c(l + 6) 

l + bc l + bc 

Now, a e [^, n m ], 6e[^r,n m ], c e [£, n 2 ] so that: 

l + o 1 + ra m 



1 + be ~ 1 + n m+2 

Therefore, e 2 < P(i)ei and dcgP(n) = 2. The above proof for a tag X} can be readily 
extended for a set of tags T r . Hence in our algorithm PTAS(T r ), each skipped product 
has at least one representative product retained in the compressed set. 

Proof of Part 2: Consider any tag group T r , and let o OPT be the optimal product for 
this group, and o APP be the product returned by PTAS. From Part 1, for every product o 
in the set 5*," (assuming no compression was used in any iterations), there is a product 
d in the compressed set S' m that satisfies 

E(o,T r ) <(l + <7) m E(o',T r ) (10) 

In particular, the following holds 

E(o OPT ,T r ) < (l + a) m E(o APP ,T r ) (11) 

Since a = e/2m, we get: 

E(o OPT ,T r ) < (1 + — ) m E(o APP ,T r ) 
2m 

< eiE(o APP ,T r ) 

< (l + e)E(o APP ,T r ) (12) 
Therefore, the output of PTAS(T r ) o APP has an exact score that is at least r^s. times 

the exact score of the optimal product o OPT . 
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THEOREM 5.3. Given a user defined approximation factor e, a non-constant num- 
ber of tags z grouped into jp- groups of z' tags per group, and for k = I, if we set the 
compression factor a = e/2m, then: 

(1) The output of PA has an exact score that is at least z( ^ +e) times the exact score of the 
optimal product 

(2) PA runs in polynomial time 

Proof of Part 1 : The analysis in Theorem 15.21 is for a single tag group T r having 
constant number of tags z' . Since there are z/z' tag groups in totality, it is easy to see 
that this introduces an additional factor of z'/z to the overall approximation factor, i.e., 
the output of PA has an exact score that is at least z M +e \ times the exact score of the 
optimal product. 

Proof of Part 2: To show that PA is a polynomial time algorithm, the main task is to 
show that the compressed lists are always polynomial in length. We first observe that 
probability quantities such as Pr(a,i \ Tj) are rational numbers, where both the numer- 
ator as well as the denominator are integers bounded by n (i.e., the number of prod- 
ucts in the dataset). From Equation^ note that the score of a product involves m such 
probability quantity multiplications, where m is the number of attributes. Therefore, 
the score of any product for any single tag can be represented as a rational number, 
where the numerator and denominator are integers bounded by 0(n m ). Thus, we can 
normalize each such score into an integer by multiplying it with 0(n m ). 

Next, consider a z'-dimensional cube with each side of length L = 0(n m ). We par- 
tition the cube into z'-dimensional cells as follows: Along each axis, start with the 
furthest value L, and then proceed towards the origin by marking the points L/(l + a), 
L/(l + a) 2 , and so on. The number of points marked along each axis is log (1+(j) L = 
0(m logn +(T ) n) which is a polynomial in m and n. Then at each marked point we pass 
(z' — l)-dimensional hyperplanes perpendicular to the corresponding axis. Their inter- 
sections creates 0(poly(m,n) z ) cells within cube L z . 

Due to this skewed method of partitioning cube into cells, we see that the cells that 
are further away from the origin are larger. Consider the ith iteration of the PTAS al- 
gorithm. Each product in Si may be represented as a point in this cube. Though within 
any cell there may be several points corresponding to products of Si, after compression 
there can be at most only one point corresponding to a product of S' r , because two or 
more points could not have survived the compression process. The length of any com- 
pressed list in the PTAS algorithm is at most 0(poly(m,n) z ). When z' is a constant, 
this translates to an overall polynomial running time for PA. □ 

Extending from Top-1 to Top-k: Our PA algorithm can be modified to return top-fc 
products instead of just the best product. For the tag group T r , once a set of products Si 
is built, we compress to form the set $•. However, every time a cluster representative is 
selected, instead of deleting all the remaining points in the cluster, we remember k — 1 
products within the cluster and associate them with the cluster representative (and if 
the cluster has less than k products, we remember and associate all the products with 
the cluster representative). 

When all the m iterations are completed, we can return the top-fc products as follows: 
we first return the best product of S' m along with the k — 1 products associated with 
it. If the number of associated products are less than k — 1, the second best cluster 
representative of S' m and the set of products associated with it are returned, and so on. 

When the approximate top-fc products from all tag groups have been returned, the 
main algorithm returns the overall best top-fc products from among them. It can be 
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shown that this approach guarantees an approximation factor for the score of the top- 
k products returned. 

Grouping of Tags: The PA algorithm partitions the set of tags into constant-sized 
groups. We can employ techniques similar to the grouping of attributes technique for 
ETT algorithm in order to group related tags together in a principled fashion. However, 
the bounds and properties of PA algorithm are not affected by this. 

5.2. HC: Hill-Climbing Algorithm 



Algorithm 3 HC (Naive Bayes probabilities): top-fc local optimal products 

l 

2 
3 
4 
5 
6 
7 



9 

10 
11 
12 
13 
14 



localOptimaFound <— false 

Randomly generate a product o s (boolean vector) of length m 
while localOptimaFound is false do 
for i = 1 to m do 
Oi «— Neighbors(o s ) 
NeighborScore(o s ) <- ExactScore(o l ) 
end for 
if Max(NeighborScore(o s )) > ExactScore(o s ) then 

o s «- Oi {oi is highest score in NeighborScore(o s )} 
else 

localOptima <- true 
end if 
end while 
return o s 



Our second approximation algorithm (HC) is based on the generic hill-climbing 
heuristic, often used for solving complex optimization problems. The algorithm starts 
from a random solution to the problem (starts at the base of the hill) and then repeat- 
edly improves the solution (walks up the hill) until some condition is maximized (the 
top of a local hill is reached). In the light of our framework, we generate a random prod- 
uct, i.e., a boolean vector of size m. At every climbing step, we check all its immediate 
neighboring product by examining if a single bit can be flipped to improve the score 
of the product (by Equation [5}. If there exists such a neighboring product, we proceed 
to that neighboring product and repeat the climbing step, until a local maximum is 
reached. Multiple (k or more) random restarts of the hill-climbing technique may lead 
us to the multiple locally optimal products, of which the top-/c may be returned. 

Algorithm [3] contains details of our HC algorithm. We illustrate the HC algorithm 
with an example. 

Example 5.4. Consider the boolean dataset in Table [TTT1 of 8 products, each entry 
having 4 attributes and 2 tags. We first generate a random product, 1010 whose score 
is 1.70. The immediate neighbors of 1010 are products 0010, 1110, 1000 and 1011 having 
scores 1.46, 1.77, 0.89 and 1.76 respectively. Since score of neighbor 1110 exceeds that 
of starting product 1010, we climb to 1110. The neighbors of 1110, namely 0110, 1010, 
1100 and 1111 all have scores lesser than that of 1110 and hence we terminate with 
1110 as the local optimal product (for this example, it also happens to be the global 
optimum product). □ 

The hill-climbing heuristic is simple, and as we shall discuss later, was found to be 
extremely effective in our experiments even for large problem instances. However, it 
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has several theoretical limitations. As the following theorem shows, there is no guaran- 
tee that it can produce the global optimum; moreover the score of the globally optimal 
product may be exponentially better than the score of a locally optimal product. 

THEOREM 5.5. There exists a boolean dataset with m attributes and z tags, and 
two potential products oi and o g where oi and o g are a local and the global optimum 

respectively, such that tTiscorS) = n K")- 

Proof: The exact score of the globally optimum, i.e., the best product (from Equa- 
tion [5]) can at most be 1 since p^\\T ) e tn' n l an< ^ n an< ^ m are usuan y high, so that 
, i 1 m ps 1. The local maxima that the HC heuristic may return in the worst case is the 
product with the lowest score. From Equation^ the exact score of the worst case locally 
optimum product is , , 1 „ rj -^- when n m is high. Therefore, xact core (°s) > _i _ n m 

r r l-)-n m n m ° ' Exactbcore{oij — —^ 

■ E x actS cor e(o g ) _ r^l m\ 

1,e, » ExactScoreiot) ~ il V l I 

Hence, there exists a dataset for which the expected number of tags of a globally 
optimum product can be exponentially larger than that of a locally optimum product. 
□ 

A second limitation of HC is that even for a finding locally optimal product, there is 
no guarantee on the time taken for termination, and sometimes the convergence may 
be very slow. This is because certain bits may have to be flipped over and over again, 
thus forcing the algorithm to explore a huge part of the 2 m search space of potential 
products. 

6. EXPERIMENTS 

We conduct a set of comprehensive experiments using both synthetic and real datasets 
for quantitative and qualitative analysis of our proposed algorithms. Our quantita- 
tive performance indicators are (a) efficiency of the proposed exact and approximation 
algorithm, and (b) approximation factor of results produced by the approximation al- 
gorithm. The efficiency of our algorithms is measured by the overall execution time 
and the number of products that are considered from the pool of all possible products, 
whereas approximation factor is measured as the ratio of the acquired approximate 
result score to the actual optimal result score. We also conduct a user study through 
Amazon Mechanical Turk study as well as write interesting case studies to qualita- 
tively assess the results of our algorithms. 

System configuration: Our prototype system is implemented in Java with JDK 5.0. 
All experiments were conducted on an Windows XP machine with 3.0Ghz Intel Xeon 
processor and 2GB RAM. The JVM size is set to 512MB. All numbers are obtained as 
the average over three runs. 

Real Camera Dataset: We crawl a real dataset of 100 cameras^ listed at Amazon 
(http://www.amazon.com). The products contain technical details (attributes), besides 
the tags customers associate with each product. The tags are cleaned by domain ex- 
perts to remove synonyms, unintelligent and undesirable tags such as nikon coolpix, 
quali, bad, etc. Since the camera information crawled from Amazon lacks well-defined 
attributes, we look up Google Products (http://www.google.com/products) to retrieve a 
rich collection of technical specifications for each product. Each product has 40 boolean 
attributes, such as self -timer , face-detection, red-eye f ix, etc; while the tag dic- 
tionary includes 40 unique keywords like lightweight, advanced, easy, etc. 



5 As discussed earlier, the number of products in the dataset is not important for the execution cost; analysis 
in Figure [9] 
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Real Car Dataset: We crawl another real dataset from Yahoo! Autos 
(http://autos.yahoo.com/). We focus on new cars listed for the year 2010 span- 
ning 34 different brands. There are several models for each brand, and each model 
offers several trims0 Since each trim defines a unique attribute-value specification, 
the total number of trims that we crawl are the 606 products in our dataset. The 
products contain technical specifications as well as ratings and reviews, which include 
pros and cons. We parse a total of 60 attributes: 25 numeric, and 35 boolean and 
categorical (which we generalize to boolean) such as air-conditioning, sunroof, etc . The 
total number of reviews we extract is 2180. We extract tags from the reviews using the 
keyword extraction toolkit AlchemyAPI (http://www.alchemyapi.com/api/keyword/). 
We process the text listed under pros in each review to identify a set of 15 desirable 
tags such as fuel economy, comfortable interior and stylish exterior. A car is 
assigned a tag if one of its reviews contains that keyword. 

Synthetic Dataset: We generate a large boolean matrix of dimension 10,000 
(products) x 100 (50 attributes + 50 tags) and randomly choose submatrices of varying 
sizes, based on our experimental setting. We split the 50 independent and identically 
distributed attributes into four groups, where the value is set to 1 with probabilities of 
0.75, 0.15, 0.10 and 0.05 respectively. For each of the 50 tags, we pre-define relations 
by randomly picking a set of attributes that are correlated to it. A tag is set to 1 with 
probability p if majority of the attributes in its pre-defined relation have boolean 1. For 
example, assume tag T ± is defined to depend on attributes A 13 , A 25 and A i0 . T ± is set 
to 1 with a probability of 0.67 if 2 out of A 13 , A 25 and A i0 are 1. 

We use the synthetic datasets for quantitative experiments, while the real data are 
used in user and case study. 

6.1. Quantitative Results: Performance 

Exact Algorithm: We first compare the Naive approach with our ETT. Since the Naive 
algorithm can only work for small problem instances, we a pick a subset from the 
synthetic dataset having 1000 products, 16 attributes and 8 tags. Figures [4] and [5] 
compare the execution time and the number of candidate products considered, for 
Naive and ETT respectively, when the number of attributes (m) varies (number of 
products = 1000, number of tags = 8). Note that the number of products considered by 
ETT is the number of products created in tier-2 by joining products from tier-1. The 
Naive algorithm considers all 2™ products. We used as number of attributes per group 
ml = 2, 2, 4, 5, 4, 7, 4, 6 for m = 4, 6, 8, 10, 12, 14, 16, 18 respectively in ETT (more analysis 
of ml in Figures [6] and 0. As can be seen, Naive is orders of magnitude slower than 
ETT. 

Next, we study the behavior of attribute groupings on ETT. For a sub-sample picked 
from our synthetic dataset having 20 attributes, 1000 products and 8 tags, we exper- 
iment with different possible attribute groupings, ml = 1, 2, 4, 5, 10, 20. Figures [6] 
and [7] shows the effect of ml on the performance of ETT when attributes are grouped 
arbitrarily. The execution time and number of products considered for ml = 1 is not re- 
ported in Figures [6] and [7] as it was too slow. The trade-off of choosing ml is: a small ml 
means there are many short lists in tier-1, so that the cost of joining the lists is high. In 
contrast, a large ml indicates fewer but longer lists in tier-1 resulting in increased cost 
of creating the lists. We observe that the best balance is struck when ml = 4 attributes 
forming 5 lists, each having 2 4 =16 products. 

We also employed the grouping of attributes technique in Section l4TT1 to partition the 
set of 20 attributes, and investigate if the execution time and number of products con- 



6 Trims denote different configurations of standard equipment. 
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sidered improves (i.e., decreases). We create a graph of 20 nodes (corresponding to the 
20 attributes) and 20 C 2 = 190 edges. We use the absolute value of the Pearson correla- 
tion to determine the edge weight because even if two attributes are anti-correlated, 
they should be grouped together and the 1 from one will be combined with the from 
the other to create a high-scored entry (we use 0.5 for dont care). Then, we employ a 
graph partitioning algorithm for partitioning the 20 attributes into as many groups as 
the desired number of lists. Specifically, we use the publicly available software METIS- 
4.0 (http://glaros.dtc.umn.edu/gkhome/views/metis) for partitioning the attributes. We 
observe that when we partition the 20 attributes into 4 clusters (i.e., m' = 5 attributes 
forming 4 lists), the execution time is i the execution time in case of arbitrary grouping 
of attributes (Figure©; the number of products looked up by ETT decreases from 3639 
to 1713 (Note that the number of products looked up by Naive for this data is 1048576). 
Again, if we partition the 20 attributes into 5 clusters (i.e., m' = 4 attributes forming 
5 lists), the execution time and the number of products looked up by ETT remains the 
same as in case of arbitrary grouping of attributes (Figure ©. This is because cluster- 
ing of attributes into 4 groups generated stronger partitioning (i.e., attributes grouped 
together have stronger correlation) than clustering of attributes into 5 groups. There- 
fore, if the data is highly correlated and yields well-defined clusters or partitions, ETT 
benefits significantly by employing principled grouping of attributes. 

Next, we vary the number of tags z and number of products n in the dataset to study 
the behavior of ETT. We pick a subset from the synthetic dataset having 1000 prod- 
ucts, 16 attributes and 16 tags, and consider further subsets of this dataset. Figure [8] 
reflects the change in execution time with increasing number of tags for the synthetic 
data (number of products = 1000, number of attributes = 12, attribute grouping = 3). 
The increase in number of tags increases the number of GetNextO operations in ETT, 
and hence the running time rises steadily. Figure [9] depicts how an increase in the 
number of products in the dataset (number of attributes = 12, number of tags = 8, 
attribute grouping = 3) barely affects the running time of ETT since an initialization 
step calculates all conditional tag-attribute probabilities. 

Approximation Algorithms: We observe in Figure [4] that the execution time of 
ETT outperforms that of Naive, for moderate data instances. However, ETT is ex- 
tremely slow beyond number of attributes (m) = 16, which makes it unsuitable for 
large real-world datasets having many attributes and tags. Therefore, we move to our 
approximation algorithms HC and PA, and compare their execution time and sub- 
optimal product score in Table (TV) We pick three different subsets of 1000 products 
from the synthetic dataset: (number of attributes = 8, number of tags = 4), (number of 
attributes = 12, number of tags = 8) and (number of attributes = 16, number of tags = 
8). We execute PA algorithm at an approximation factor (0.8) and HC algorithm with- 
out multiple random-restart. The execution time of PA for a moderately large dataset 
(1000 products, 16 attributes and 12 tags) indicates that it is unlikely to scale to large 
(real) datasets. Nevertheless, it is the only algorithm, of the two, which provides worst 
case guarantees in both time complexity and result quality. On the other hand, our 
HC algorithm is quite effective even as the number of attributes and tags increases. 
For the dataset (1000 products, 16 attributes and 12 tags), HC is 10 4 times faster than 
PA, while the quality of sub-optimal product remains comparable. Table [IV] shows a 
situation (n=1000, m=8, z=4) when HC takes similar amount of time as PA to retrieve 
identical sub-optimal product and another situation (n=1000, m=12, z=8) when HC 
takes lesser amount of time than PA to retrieve a sub-optimal product inferior in qual- 
ity to that returned by PA. However, multiple random restarts of HC may lead us to 
better sub-optimal product(s). 
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3.077 
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6.2. Qualitative Results: User Study 

We now validate how designers can leverage existing product information to design 
new products catering different groups of people in a user study conducted on Amazon 
Mechanical Turk (https://www.mturk.com) on the real camera dataset. We also consult 
DPreview (http://www.dpreview.com), a website about digital cameras and digital pho- 
tography. There are two parts to our user study. Each part of the study involves thirty 
independent single-user tasks. Each task is conducted in two phases: User Knowledge 
Phase where we estimate the users' background and User Judgment Phase where we 
collect users' responses to our questions. 

In the first part of our study, we build four new cameras (two digital compact and two 
digital sir) using our HC algorithm by considering tag sets corresponding to compact 
cameras and sir cameras respectively. We present these four new cameras along with 
four existing popular cameras (presented anonymously) and observe that 65% of users 
choose the new cameras, over the existing ones. For example, users overwhelmingly 
prefer our new compact digital camera over Nikon Coolpix L22 because the former 
supports both automatic and manual focus while the latter does not, thus validating 
how our techniques can benefit designers. 
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User Study 




Camera 

Fig. 10. Users Classify Cameras Correctly 



The second part of the study concerns six new cameras designed for three groups 
of people: young students, old retired and professional photographers. Domain experts 
identify and label three overlapping sets of tags from the camera dataset's complete 
tag vocabulary, one set for each group and we then build two potential new cameras 
for each of the three groups. For each of the six new cameras thus built, we ask users 
to assign at least five tags by looking up the complete camera tag vocabulary, pro- 
vided to them. We observe that majority of the users rightly classify the six cameras 
into the three groups. The correctness of the classification is validated by comparing 
the tags received for a camera to the three tag sets identified by domain experts; we 
also validate the correctness by consulting data available in Dpreview. For example, 
the cameras designed by leveraging tags corresponding to professional photographers 
draw tags like advanced, high iso, etc. while cameras designed by leveraging tags cor- 
responding to old retired draw tags like lightweight, easy, etc. Figure [lOl shows the 
percentage of users classifying the six cameras correctly. Thus, our technique can help 
designers build new products that are likely to attract desirable tags from different 
groups of people. 



6.3. Qualitative Results: Case Study 

We present few interesting anecdotal results returned by our framework on the real 
car dataset to validate that our algorithms help us draw interesting conclusions about 
the desirability of certain car specifications (attribute values). Our HC algorithm indi- 
cates that cars having child safety door locks, 4-wheel anti lock brakes, AM/FM radio, 
keyless entry, telescoping steering wheel, compass, air filter, trunk light, smart coin- 
holder and cup holder, etc. are the features likely to elicit positive feedback from the 
customers (i.e., features that maximize the set of desirable tags for real car dataset). 
When we design new cars by considering only those car instances as our training set 
which have received the tag economy, we observe that some luxury features like heated 
seats, in-dash CD changer system, sunroof/moonroof and leather upholstery are re- 
turned by our framework. This indicates that users prefer selective luxury features 
when buying economy cars. Also, sports cars designed using our algorithm (by consid- 
ering only those car instances as our training set which have received the tag sports) 
are found to contain safety features, thereby indicating that safety features have be- 
come a high priority requirement for users buying sports cars. 
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7. RELATED WORK 

Tag prediction: The dynamics of social tagging has been an active research area 
in recent ye ars, with severa l papers focusing on the tag prediction problem. A 
recent work [Yin et al. 201011 proposes a probabilistic model for personalized tag 
pred iction and employs the Naive Bayes classifier. Related research in text min- 
ing MPak and Paroubek 201011 found that the Naive Bayes classifier performs bet- 
ter than SVM and CRF in classifying blog sentiments. Another study that indi- 
rectly supports the use of Naive Bayes for tag prediction is done by Heymann 



et al. [Heymann et al. 2008J , who found that tag-based association rules can pro- 
duce very high-precision predictions. Th e pr ocess of collaborative tagging has been 
studied in [Golder an d Huberman 200611 and MGolder and Huberman 2005ll : the chal- 
lenges associated with tag recommen d ations for collabo rative tagging systems have 
been discussed in MJschke et al. 20121 . HKim et al. 201011 develops a new unique rec- 
ommendation algorithm via collaborative tags of users to provide enhanced recom- 
mendation quality and to overcome some of the limitations in collaborative filter- 
ing systems. Other related work investigates tag suggestion, us ually from a co llab- 
orative filte ring and UI perspective; for example with images MWu et al. 201311 and 
blog p osts |M ishne 2006]. Due to the high popularity of social bookmarking sys- 
tems, [Michlmayr 2007) proposes a technique for building a user profile from a users 
tagging behaviour, thereby indicating the usefulness of tags in representing user opin- 
ion. 

Item design: The problem of item design has been studied by many 
discipl ines including ec onomics, industrial engineering and computer sci- 
ence BSelkar and Burless on 20001. Optimal item design or positioning is a 
well studied problem in Opera tions Research and Marketing. Shocker et 
al. UShocker and Srinivasan 19741 first represented products and consumer 

i references as points in a joint attribute space. Late r, several techniques 
Albers and Brockhoff 19801 , RAlbritton and McMullen 20071 were developed to 
design/position a new item. Work in this domain requires direct involvement 
of co nsumers, who choos e preferences from a set of existing alternative prod- 
ucts. [Harding et al. 2006 1 reviews the role of data mining in manufacturing engi- 



neering, in particular production processes, operations, fault detection, maintenance, 
decision support, and product quality improvement. Tucker proposes a machine learn- 
ing inodel_to_capture emergi ng customer pr eference trends within the market space 
in [Tucker 20 Hi Miah et al. IMiah et al. 20 09 1 study the problem of selecting product 
snippets given a user query log, in order for the designed snippet to be returned 
by the maximum number of queries. Our problem is different because the tags of a 
product are corre lated to its attributes (through a classifier), whereas the queries in 
JMiah et al. 20 091 are boolean section conditions. However, none of these works has 
studied the problem of item design in relation to social collaborative tagging. 

Top-k algorithm s: Our top-fc pipe li ned algorithm is inspired by the rich work on 
top -fc algorithms ||Fagin et al. 2001 1, piyas et a l. 20091. A recent survey by Ilyas et 



al. | Ilyas et al. 2008] covers many of the important results in this area. The classic set 



ting of these works is that each list contains an attribute of an object and a monotone 
aggregate function is used for ranking. The top tier of our pipelined top-/c algorithm 
is adapted from this setting, where each list has the probability of a tag for each as- 
signment of attribute values. In contrast, in the bottom tier of our algorithm, an entry 
from one list can match with any entry from t he other lists. This setting is adapted 



from the problem of top-k join [Ilyas et al. 2009]. 
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8. CONCLUSIONS 

In this paper we consider the novel problem of leveraging online collaborative tagging 
in product design. We formally define the Tag Maximization problem, investigate its 
computational complexity, and propose several principled algorithms that are shown to 
work well in practice. Our work is a preliminary look at a very novel area of research, 
and there appear to be many exciting directions of future research. Our immediate 
focus is to extend our work to include tag prediction using other classifiers, such as 
decision trees, SVMs, and regression trees (the latter is applicable when we wish to 
predict the frequency of occurrence of desirable tags attracted by products). We also 
intend to evaluate the applicability of our proposed framework to other novel applica- 
tions, e.g., guide recommender systems recommend better vacation travel itineraries 
by tracking tag history, help online authors write better blogs, and others. 
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